Welcome back to deep learning. So today we want to talk a bit more about the ideas of object detection.
And we will start with a small motivation on object detection and some key ideas how this can actually be performed.
So let's have a look at our slides. You see this is already part three of our short lecture video series on segmentation and object detection.
And now the topic is object detection. Well, let's motivate this a little bit.
And the idea you remember is that we want to localize objects and we classify them.
So we want to figure out where the cats are in the image and we want to figure out whether they are actually cats.
And this is then commonly solved by generating hypothesis about bounding boxes.
Then you resample those boxes and apply a classifier.
So the main distinction of the method is how you combine and replace those steps in order to achieve higher speed or accuracy.
So we can also look into our plane example.
And of course, we are looking for bounding boxes and the bounding boxes is typically defined as the smallest box that fully contains the object in question.
And this is then typically defined as a top left corner with with W and height H and some classifier confidence for the bounding box.
And you can see that we can also use this for detecting the entire plane or we could also use it for parts of the plane.
This is actually something where we already have a long history of different success stories.
One very early success story is the Viola and Jones algorithm that was one of the earliest really well working face detectors.
Here you can see the example this was using HA features and those HA features were then used in a kind of boosting cascade in order to detect faces very quickly.
So it was using large numbers of features that can be computed very efficiently.
And then in the boosting cascade, they would select only the features that are most likely to detect a face at the given position.
This was then improved with the so-called histogram of oriented gradients.
And in classical methods, you had always a good feature extraction plus some effective classification.
And here in that time, support vector machines were very common.
Of course, there's also neural network based approaches and you can do this actually quite easily with a pre-trained CNN.
So you could simply use the sliding window approach and then detect each possible window by a CNN.
There's region proposal CNNs like our CNN that finds interesting image regions first and then classifies them with a CNN.
And there's the single shot detectors that we talk about in the next video that do a joint detection and classification.
And I think one of the most famous examples is you only look once the YOLO. We already had that example also in the introduction.
And I think it's a very nice method that does all of these detection approaches in real time.
Well, the sliding window approach, you could simply take your pre-trained CNN and just move it all across the image.
And then when you find an area of high confidence, you could say, OK, there is a face and I want to detect this face.
Now, the big disadvantage here is that you have not just to detect the face, but the face could also be in different resolutions.
So you want to repeat this process on multiple resolutions.
And then you already see that detecting patch wise, this will result in a large number of patches and you have to do a large number of classifications.
So this is probably not the way that we want to go.
And one advantage, of course, would be that we don't need to train.
And no, we can simply use our trained classification network in this case, but it's computationally very inefficient.
And we want to look for some ideas how to do that more efficiently.
And one idea is, of course, that you use it with fully convolutional neural networks.
And this is then the following approach.
So we can think about the idea, how could we apply a fully connected layer to an arbitrary shaped tensor?
And one idea would be you flatten the activations, then you run your fully connected layer.
And you can get one classification result.
Instead, you can reshape your weights and then use convolution.
And this will produce exactly the same result.
And we already discussed this when we were talking about one by one convolutions.
Now, the nice property here is that if you follow this idea, we can then also work with arbitrarily shaped spatial input sizes.
And by using the convolution, we will then also produce a larger output size.
So we kind of get away with the moving window approach.
But we still have the problem that we would have to look at the different scales in order to find all of the interesting objects.
This is why people have been looking at region-based detectors.
So we know that CNNs are powerful classifiers and the fully convolutional neural networks, they help to improve the efficiency, but they are still somewhat brute force.
Presenters
Zugänglich über
Offener Zugang
Dauer
00:13:52 Min
Aufnahmedatum
2020-10-12
Hochgeladen am
2020-10-12 22:26:27
Sprache
en-US
Deep Learning - Segmentation and Object Detection Part 3
In this video, we start looking into object detection. We start with classical ideas, re-visit the concept of a fully convolutional neural network, and start developing a fast regional CNN detector which finally leads to Faster RCNN.
For reminders to watch the new video follow on Twitter or LinkedIn.
Additional References
nnU-Net: Self-adapting Framework for U-Net-Based Medical Image Segmentation
X-ray-transform Invariant Anatomical Landmark Detection for Pelvic Trauma Surgery
Further Reading:
A gentle Introduction to Deep Learning